1 Abstract

1.1 Topic

Our topic refers to the real estate industry. In this project, we will use King County as an example to examine prices rates and how the prices vary by some features. Then, we will use our data analysis to gain some business insight about real estate.

1.2 Motivation and Background

Real estate has always been a worthwhile investment.However, changes in house prices are influenced by many factors.From an investor’s business perspective, how to decide what properties are worth investing in is a question worth considering.

According to an article on Wall Street Journal link to source, some attributes such as house size, bedrooms are not as significant as before. The location and community gradually become dominant factors.We searched online and found this article link to source which shows 8 critical factors that influence a home’s value including neighborhood, location, upgrades, etc.

1.3 Objectives

We want to put ourselves in the shoes of real estate investors, both individuals and companies like Airbnb. They would love to maximize the profit of their properties. In order to do that, we have to spot the houses with highest growth potential to expand the market. To clarify that, we need to find the potential houses by analyzing the relationships between prices and other important variables of the houses.

What factors will influence the house valuation?
What kind of houses are worth investing in?

1.4 Stakeholders

Some stakeholders may find our data analysis is useful to their business and decision making. We created a Stakeholder Analysis Worksheet to show that:

1.5 Hypothesis

According to the research and thought process, we identified some our testable hypothesis as follow:

  1. If a property has all good features such as more bedroom, large basement, new built, etc., it will have a higher price.
  2. All features of a property will be good predictors of the level of market price.
  3. If some investors’ properties have all features that seen as strongly relevant, then they will gain more profit on those properties in the future.

1.6 Plan

Next, we will explore our data and try to find patterns by using explanatory data analysis method including a map that contain the distribution of the houses and their prices and some plots that show the potential relationship between features and price. In the last part, we will build a model that can show the relationship between price of the houses and features. Also, we will conduct a hypothesis test and make inference.

2 Data

2.1 Data Source

This dataset contains house sale prices for King County, which includes major cities such as Seattle. We decide to use this dataset because king county is pretty interesting. This area has a huge wealth disparity. Some of the world’s richest people live there, and yet 10% of the county’s residents live in poverty, which make it representative to most regions in the US or even the world.

Dataset URL

2.2 Variable Selection

For variable selection, we created a interactive heatmap (you can point each area to see the value of correlation) for correlation between variables since we would like to know what factors will relevantly influence the house prices.

Based on our findings and visualizations, we decide to use the following variables because they have plausible correlations:

Notice: _We didn’t choose sqft_basement and sqft_above because the sum of these two variables is the sqft_living. Therefore, we decide to use sqft_living to represent them. Also, we dropped some variables such as condition, yr_built, etc. which have very low correlations.

price: House prices bedrooms: Number of bedrooms
bathrooms: Number of bathrooms
sqft_living: Footage of living space
waterfront: Is house has a view to a waterfront - “0” means no, “1” means yes
floors: Floors(levels) in house
grade: Overall grade given to the housing unit, based on King County grading system
sqft_living15: The square footage of interior housing living space for the nearest 15 neighbors

2.3 Check Data Quality

2.3.1 Missing values

We use the code below to show the number of missing values in our dataset. Then we get a result “0”, which means there is no missing values in our dataset.

# Use sum() to show the amount of missing values
sum(is.na(kchouse))
## [1] 0

2.3.2 Outliers

In this section, we decide to check if there are some outliers under our selected variables. The method we use is boxplot. Since some variables have their own traits, we will not have enough evidence or knowledge base to decide whether they should be removed. Therefore, we will test the variables that we have confidence to decide.

Firstly, we will check the bedrooms column:

Notice: According to the boxplot above, there is a house with 33 bedrooms. To decide whether this value should be dropped, we found the house link and noticed that this value was a typo. Therefore, we decided to drop this observation.

#Remove the house with 33 bedrooms
kchouse <- subset(kchouse, bedrooms!= 33)

secondly, we will check the bathrooms column:

Lastly, let’s check the grade column:

boxplot(kchouse$grade, xlab = "grade", horizeontal = TRUE)

Indeed, there are many outliers in our data. We used boxplot to visualize and identify them. We decided to keep them all except the house with 33 bedrooms, because other data do not seem to be mismeasured or randomly generated. For example, the housing prices are similar to stock prices, both parties had intentions to agree on such price levels. It is truly how trade happens and how the prices were recorded.

3 Exploratory data analysis

In this part, we will make some plots to find potential patterns that will be plausible to our hypothesis.

3.1 Price Distribution

At first, we will show the distribution of price which is the most important criteria. We want to know if it is a normal distribution and whether it is good enough for our follow-up analysis.

We can see that the price distribution is right skewed. To find the relationship between data more easily for coming analysis and modelling, we decide to use log(price). First, we calculated the log price and create a new variable for it in the dataset kchouse and then plot the new column.

Now, the distribution is more like a normal distribution.

3.2 Price Variaty Map

Here, we use a interactive map to showcase price distribution with different color. This map clearly shows that different locations of houses have different prices. For example, houses near airport have the lowest level of price. However, houses near lakes or in the suburb have higher prices. These findings fueled our thinking process about whether location factors such as waterfront can influence the house prices.

3.3 Bedrooms and Price

Based on the bar plot below, We can examine how the number of bedrooms influence the house price. The plot shows a pattern that house price will increase when the number of bedrooms increases from 1 bedroom to 8 bedrooms, which fits our first hypothesis. However, after 8 bedrooms, the prices decrease. Therefore, we have to conduct a deeper analysis to reveal the true relationship between bedrooms and price.

3.4 Bathrooms and Price

Based on the bar plot below, We can examine how the number of bathrooms influence the house price. The plot shows a pattern that house price will increase when the number of bathrooms increases, which justified our first hypothesis and gives us a positive sense as well to continue our analysis.

3.5 Living space and Price

The plot below is very important because it helps us to visualize how living space area will influence the house valuation. Based on this plot we found out that the correlation between living space area and house price is positive and quite strong. Therefore, we are looking forward to test it by using regression in the modeling part.

3.6 Waterfront and Price

Waterfront is a very interesting variable because we read about that people would like to pay more on houses with waterfront view in our findings. Therefore, we would like to test if the prices of waterfront houses are higher than houses without water view. The bar plot confirmed our hypothesis. The waterfront houses which represented by 1 have higher prices than non-waterfront houses which represented by 0.

3.7 Floors and Price

If you wonder what a 0.5 floor level is, that just means the roof is shaped with a slope and the usage of top floor is 50% less than the one without a slope. It is indeed an important factor if the buyers care about their houses being vertical enough. Although the top side doesn’t show much. If you look at the plot below, the pattern is not as clear as others. Although, we can still tell that more floors means higher price.

3.8 Grade and Price

King County has their own grading system for houses. Higher grade indicates better architectural design, better materials, higher build quality,etc. From the bar plot below, we can clearly see the pattern which shows that the house price increases with the increase of the grade. There is no doubt that this plot matches our hypothesis before.

3.9 sqrt_living15 and Price

The square footage of interior housing living space for the nearest 15 neighbors. This chart is very important because it helps us to visualize two key points: First, the orange dots below the blue line shows the houses/units with cheaper prices if we look at prices individually. Second, there are some very expensive houses on the top. If we keep the x-value constant along with the expensive ones and check out the bottom y’s, we found the cheap houses with growth potential.

4 Update and Modifications

In our hypothesis and thought process, we assume that the year of house built, house condition could be important factors to influence house price. However, after visualize the correlations between them and price, we noticing that some factors are not as significant as we thought before. Hence, we dropped some variables with extremely low correlation.

5 Analysis plan

In the next section, we decide to run some simple and multiple linear regressions and then take a closer look at the coefficients, adjusted R Square, p-value, residual, etc. Then, we will do a hypothesis test for the regression. In the end, we will find the best model in our case.

6 Modelling

6.1 Simple Regression Check

Firstly, the linear regression of single variable is done. Although sometimes the analysis of simple regression is not necessary and the results may not be reliable, it is still helpful to preliminary explore the relationship between independent variables and dependent variables.

summary(lm(log_price ~ bedrooms,kchouse))
## 
## Call:
## lm(formula = log_price ~ bedrooms, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.66326 -0.35771 -0.00512  0.31436  2.39052 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.361753   0.012894   958.7   <2e-16 ***
## bedrooms     0.203607   0.003695    55.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4932 on 21610 degrees of freedom
## Multiple R-squared:  0.1232, Adjusted R-squared:  0.1232 
## F-statistic:  3037 on 1 and 21610 DF,  p-value: < 2.2e-16
summary(lm(log_price ~ bathrooms,kchouse))
## 
## Call:
## lm(formula = log_price ~ bathrooms, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05933 -0.32650 -0.00039  0.29271  2.09235 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.251198   0.008738 1401.98   <2e-16 ***
## bathrooms    0.376685   0.003883   97.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4396 on 21610 degrees of freedom
## Multiple R-squared:  0.3034, Adjusted R-squared:  0.3034 
## F-statistic:  9412 on 1 and 21610 DF,  p-value: < 2.2e-16
summary(lm(log_price ~ sqft_living,kchouse))
## 
## Call:
## lm(formula = log_price ~ sqft_living, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.97793 -0.28542  0.01473  0.26071  1.27629 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.222e+01  6.374e-03  1916.9   <2e-16 ***
## sqft_living 3.988e-04  2.803e-06   142.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3785 on 21610 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4835 
## F-statistic: 2.023e+04 on 1 and 21610 DF,  p-value: < 2.2e-16
summary(lm(log_price ~ waterfront,kchouse))
## 
## Call:
## lm(formula = log_price ~ waterfront, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81454 -0.36371 -0.02278  0.32944  2.81694 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.039786   0.003541 3682.38   <2e-16 ***
## waterfront   1.062832   0.040775   26.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5186 on 21610 degrees of freedom
## Multiple R-squared:  0.03048,    Adjusted R-squared:  0.03044 
## F-statistic: 679.4 on 1 and 21610 DF,  p-value: < 2.2e-16
summary(lm(log_price ~ floors,kchouse))
## 
## Call:
## lm(formula = log_price ~ floors, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.79343 -0.37170 -0.01141  0.31963  2.56932 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.595104   0.010021 1256.87   <2e-16 ***
## floors       0.302943   0.006307   48.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5007 on 21610 degrees of freedom
## Multiple R-squared:  0.09647,    Adjusted R-squared:  0.09643 
## F-statistic:  2307 on 1 and 21610 DF,  p-value: < 2.2e-16
summary(lm(log_price ~ grade,kchouse))
## 
## Call:
## lm(formula = log_price ~ grade, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.43312 -0.26305 -0.00601  0.24205  1.78121 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.633680   0.016777   633.8   <2e-16 ***
## grade        0.315287   0.002166   145.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3742 on 21610 degrees of freedom
## Multiple R-squared:  0.4951, Adjusted R-squared:  0.4951 
## F-statistic: 2.119e+04 on 1 and 21610 DF,  p-value: < 2.2e-16
summary(lm(log_price ~ sqft_living15,kchouse))
## 
## Call:
## lm(formula = log_price ~ sqft_living15, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.89287 -0.29882 -0.00761  0.26045  1.95090 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.210e+01  8.625e-03    1403   <2e-16 ***
## sqft_living15 4.759e-04  4.104e-06     116   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4135 on 21610 degrees of freedom
## Multiple R-squared:  0.3836, Adjusted R-squared:  0.3836 
## F-statistic: 1.345e+04 on 1 and 21610 DF,  p-value: < 2.2e-16

6.2 Multiple Regression

Here, we will use log_price as our dependent variable to reduce the probability of collinearity and heteroscedasticity. Then we will run a multiple regression as following:

model <- lm(formula = log_price ~ bedrooms + bathrooms + sqft_living + waterfront + floors + grade + sqft_living15, data = kchouse)

summary(model)
## 
## Call:
## lm(formula = log_price ~ bedrooms + bathrooms + sqft_living + 
##     waterfront + floors + grade + sqft_living15, data = kchouse)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5343 -0.2525  0.0015  0.2387  1.2282 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.125e+01  2.096e-02 536.522  < 2e-16 ***
## bedrooms      -1.491e-02  3.337e-03  -4.466 8.01e-06 ***
## bathrooms     -4.135e-03  5.174e-03  -0.799    0.424    
## sqft_living    1.983e-04  5.574e-06  35.584  < 2e-16 ***
## waterfront     6.053e-01  2.742e-02  22.072  < 2e-16 ***
## floors        -7.543e-03  5.228e-03  -1.443    0.149    
## grade          1.717e-01  3.522e-03  48.745  < 2e-16 ***
## sqft_living15  7.059e-05  5.571e-06  12.672  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3456 on 21604 degrees of freedom
## Multiple R-squared:  0.5696, Adjusted R-squared:  0.5694 
## F-statistic:  4084 on 7 and 21604 DF,  p-value: < 2.2e-16

After simple and multiple regression analysis, there are some interesting points we find useful to enhance our model. Bathrooms and floors are statistically significant in simple regressions since their p-values are very close to 0. However, those two variables in multiple regression have very large p-value which indicate the effect is not statistically significant. Then we checked the correlation between those two variables and other independent variables. We found that bathrooms and floors have positive correlations with other independent variables such as sqrt_living, grade, etc, which means the effect of bathrooms and floors on dependent variable in univariate analysis also includes the positive effect of other variables. Hence, we will remove bathrooms and floors and built a new model.

new_model <- lm(formula = log_price ~ bedrooms + sqft_living + waterfront + grade + sqft_living15, data = kchouse)
summary(new_model)
## 
## Call:
## lm(formula = log_price ~ bedrooms + sqft_living + waterfront + 
##     grade + sqft_living15, data = kchouse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53648 -0.25279  0.00168  0.23809  1.22382 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.125e+01  2.086e-02 539.143  < 2e-16 ***
## bedrooms      -1.550e-02  3.268e-03  -4.744 2.11e-06 ***
## sqft_living    1.962e-04  5.137e-06  38.204  < 2e-16 ***
## waterfront     6.061e-01  2.742e-02  22.102  < 2e-16 ***
## grade          1.692e-01  3.301e-03  51.267  < 2e-16 ***
## sqft_living15  7.170e-05  5.543e-06  12.935  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3456 on 21606 degrees of freedom
## Multiple R-squared:  0.5695, Adjusted R-squared:  0.5694 
## F-statistic:  5716 on 5 and 21606 DF,  p-value: < 2.2e-16

6.3 Interpretation

Now, we have the coefficients of our model:

\[ \begin{aligned} &\beta_0 = 1.125e+01 \\ &\beta_1 = -1.550e-02 (bedrooms) \\ &\beta_2 = 1.962e-04 (sqft_living)\\ &\beta_3 = 6.061e-01 (waterfront)\\ &\beta_4 = 1.692e-01 (grade)\\ &\beta_5 = 7.170e-05 (sqft_living15)\\ \end{aligned} \]

First, the adjusted R Square is 0.5694. It means 56.94% of observation values can be explained by the fitted model.

Since our dependent variable is log_price, we will interpret the coefficients in another way. For example, if the sqft_living increases by 1%, the house price will increase by 1.962e-04 percent. However, you can see that the coefficient of bedrooms is negative. We will still keep it because multiple regression is about the effect on price of many independent variables combined together.

6.4 Hypothesis Test

Now, we will analyze our model to see if it is good. First, we plan to conduct a hypothesis test. Our null and alternative hypothesis are shown below:

\[ \begin{aligned} H_0 &: \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5= 0 \\ H_A &: \beta_1 \neq \beta_2 \neq \beta_3 \neq \beta_4 \neq \beta_5 \neq 0 \end{aligned} \]

To reject null hypothesis, we will check the p-value of each variable. It turns out all p-values are very close to 0 which gave us strongly evidence to reject null hypothesis. Then, all the effects are statistically significant. Now, we can say those independent variables have relationships with dependent variable which is price.

6.5 Residuals Test

We can see in the figure below that the residuals are close to the standard normal distribution.Since the mean value of the standard normal distribution is 0, it indicates that the fitting effect does not depend on some special values, so our regression model is reasonable.

set.seed(123) 
augment(new_model) %>% 
  slice_sample(n=1000) %>% # Use sample instead of all data in dataset
  ggplot(., aes(x=.resid)) +
  geom_density(fill="skyblue")

7 Conclusion

In the above analysis, we identified a combination of factors that influenced the price of a house in King County, including the number of bedrooms, the size of living space, waterfront, the level of grade, and the average living space of 15 neighboring homes.In the first part, our assumptions are as follows:

  1. If a property has all good features such as more bedroom, large basement, new built, etc., it will have a higher price.
  2. All features of a property will be good predictors of the level of market price.
  3. If some investors’ properties have all features that seen as strongly relevant, then they will gain more profit on those properties in the future.

Now, we can say that our first hypothesis is not true because not all features will increase the valuation of houses. Therefore, the second one is also wrong because some of variables such as floors and bathrooms are not good predictors of the level of market price. The last hypothesis is true because we have justified some variables have strong effect on house prices. So, if the investors have properties with those relevant features, or we can say potentials for appreciation, they will make more profit.

8 Key Insights

Now, we can communicate with the stakeholders.

  1. For real estate business investors, they can invest on some properties in King County with large living space, waterfront view, a very high grade. Those properties will have huge potentials to appreciate. Then, they will gain more profit. For Airbnb hosts in King County, if their house have features that we mentioned, they have more power to increase their market price. Also, they can use those variables as their top features on Airbnb website and APP.

  2. For USD cash holders, we highly suggest that they park their money in King county houses with large living space, waterfront view and a very high grade. Because U.S. federal reserve is printing money at an unprecedented rate. And U.S. currency is worth less each day. Stock market is risky at all time highs. The safest and most crisis-proof way to store value is to convert cash into good assets (houses) with our filtered factors in King county.

  3. During the COVID period, rent went down while housing prices actually went up. Most public places are closed. This only magnifies our factors at greater scale. When public gyms are closed, a larger living space can convert into a home gym or activity/yoga space. When people are put into quarantine, a waterfront view helps the residents to remain contact with nature. When people stay home more often, a very high grade house offers durability and visual aesthetics to enhance the home experience. Just trust us, invest in real estate with our filtered factors and keep smiles up when the ship (economy) goes down.

9 References

1.Forman, Laura. “Real Estate Is Now About Location, Location, Isolation.” Wall Street Journal, 1 Sept. 2020, www.wsj.com/articles/real-estate-is-now-about-location-location-isolation-11598966108?mod=searchresults_pos12&page=1. Accessed 9 Dec. 2020.

2.“8 Critical Factors That Influence a Home’s Value | Opendoor.” Opendoor, 27 Mar. 2019, www.opendoor.com/w/blog/factors-that-influence-home-value.